High throughput embedding generation system for executable code and applications

ABSTRACT

A novel high-throughput embedding generation and comparison system for executable code is presented in this invention. More specifically, the invention relates to a deep-neural-network based graph embedding generation and comparison system. A novel bi-directional code graph embedding generation has been proposed to enrich the information extracted from code graph. Furthermore, by deploying matrix manipulation, the throughput of the system has significantly increased for embedding generation. Potential applications such as executable file similarity calculation, vulnerability search are also presented in this invention.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional ApplicationNo. 62/875,830, filed on Jul. 18, 2019.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not Applicable

THE NAMES OF THE PARTIES TO A JOINT RESEARCH AGREEMENT

Not applicable

INCORPORATION-BY-REFERENCE OF MATERIAL SUBMITTED ON A COMPACT DISC

Not applicable

BACKGROUND OF THE INVENTION

Given two binary functions, we would like to detect whether they aresemantically equivalent or similar. This problem is known as “binarycode similarity detection” or “binary code search”, which has manysecurity applications, such as plagiarism detection, malware detection,vulnerability search, etc. E.g., “binary code similarity detection” canbe applied on determination if new incoming code binaries are variantsof known examples of malware.

In cybersecurity industry, to process the huge volume of executable code(e.g., malware, firmware images, etc.), security practitioners face anincreasing need to quickly detect similar functions directly inexecutable code for different purposes (e.g., malware classification,vulnerability search, etc.).

However, the existing binary code similarity detection approaches arefar from being scalable to handle an enormous amount of executable codein the wild. The normal work flow of code search is to first disassemblethe binary code and then extract features from them, and finally comparethe similarity between candidates. There have been plenty of works thattry to detect similar code in binary executables, from simplesyntax-based solutions like n-gram, to control-flow graph basedapproaches like BinDiff, to the most expensive symbolic execution andtheorem proving like BinHunt. These methods all lack in accuracy, andmost of them are fairly expensive and do not satisfy the needs forprocessing large volume of malware samples and search over a large codebase.

BRIEF SUMMARY OF THE INVENTION

As will be described in greater detail below, the instant disclosuregenerally relates to systems and methods for a high throughput embeddinggeneration system for executable code and applications.

One of the features of the present invention is to provide ahigh-throughput system for embedding generation and comparison that canbe used for potential applications such as plagiarism detection, malwaredetection, vulnerability search, etc.

Another feature of the present invention is to deploy bi-directionalgraph embedding network in embedding generation.

Another feature of the present invention is to stack the BACFGs offunctions to speed up the embedding generation process for the wholesystem.

Another feature of the present invention is to use matrix manipulationto speed up the intermediate process for similarity comparison thatallows high throughput for the whole system.

Another feature of the present invention is to apply PCA (PrincipalComponent Analysis) on embeddings to calculate the similarity ofexecutable code.

Another feature of the present invention is to combine high-throughputembedding generation and comparison with condition formula comparison toimplement precise and scalable vulnerability search.

Features from any of the above-mentioned embodiments may be used incombination with one another in accordance with the general principlesdescribed herein. These and other embodiments, features, and advantageswill be more fully understood upon reading the following detaileddescription in conjunction with the accompanying drawings and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the general flow of the whole system according tovarious embodiments of the present invention.

FIG. 2 illustrates an example of disassembled raw code andbi-directional ACFG.

FIG. 3 illustrates the representations of bi-directional ACFG.

FIG. 4 illustrates the design of high-throughput embedding generationembodiment.

FIG. 5 illustrates the approach of stacking embeddings.

FIG. 6 illustrates the process of calculating the embedding ofexecutable files.

FIG. 7 illustrates the process of searching vulnerable functions inexecutable files.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

Given two binary functions, we would like to detect whether they aresimilar. This problem is known as “binary code similarity detection” or“code search” which has many security applications, such as plagiarismdetection, malware detection, vulnerability search, etc.

Vulnerability search is one of the applications and is becomingparticularly critical and more crucial than ever in the discovery ofvulnerabilities in IoT devices. A single bug at source code level mayspread across hundreds or more devices that have diverse hardwarearchitectures and software platforms. The study by Cui et al. showedthat 80.4% of vendor-issued firmware is released with multiple knownvulnerabilities, and many recently released firmware updates containvulnerabilities in third party libraries that have been known for overeight years.

Another application of code similarity detection is malware analysis. Inparticular, it can be used to classify malware into different malwarefamilies which is one of the essential functionalities provided byantivirus software.

The normal work flow of code search is to first disassemble the binarycode and then extract features from them, and finally compare thesimilarity between candidates. There have been plenty of works that tryto detect similar code in binary executables, from simple syntax-basedsolutions like n-gram, to control-flow graph based approaches likeBinDiff, to the most expensive symbolic execution and theorem provinglike BinHunt.

Due to the huge volume of executable code (e.g., malware, firmwareimages, etc.), security practitioners face an increasing need to quicklydetect similar functions directly in executable code for differentpurposes (e.g., malware classification, vulnerability search, etc.). Inthe present cybersecurity industry, the volume of binary programsanalyzed by cybersecurity applications is huge (more than 900M, and morethan 10M per month). Cybersecurity companies allocate tremendous amountof computing resources to handle such large volume of suspicious samplesevery day. However, the above mentioned code similarity detectionapproaches are far from being scalable to handle an enormous amount ofexecutable code in the wild.

One promising approach to conduct binary code similarity detection hasbeen proposed recently. It learns high-level feature representationsfrom the control flow graphs (in short, code graph) and encode thegraphs into embeddings (i.e., high dimensional numerical vectors). Thenit uses embeddings to compare the similarity of functions extracted frombinary code. However, the generation and comparison of embeddings is notscalable enough to handle large volume of code. To overcome thisproblem, a new high-throughput embedding generation and comparisonsystem is presented herein. In the following description, for purposesof explanation, numerous specific details are set forth in order toprovide a thorough understanding of the present invention. It will beevident, however, to one skilled in the art that the present inventionmay be practiced without these specific details.

The present disclosure is to be considered as an exemplification of theinvention, and is not intended to limit the invention to the specificembodiments illustrated by the figures or description below.

The present invention will now be described by referencing the appendedfigures representing the preferred embodiments. FIG. 1 depicts theoverview of the invention. The system takes executable code 101 asinput. The raw feature extraction embodiment 103 generatesBi-Directional Attributed Control Flow Graph (BACFG) 104. Thehigh-throughput embedding generation embodiment 105 takes the BACFG asinput, and applies graph embedding to generate embeddings 106. Finally,the embeddings can be fed to high-throughput similarity calculationembodiment 107 for various applications. This invention also presentstwo applications of the approach: executable file similarity comparison108 and vulnerability search 109.

The input executable code 101 includes but is not limited to Javabytecode, binary executable code of various architectures (e.g., X86,MIPS, ARM, etc.) as long as BACFG 104 can be constructed from the inputwith proper tools.

Raw features extracted by Raw Feature Extraction embodiment 103 can beimplemented in many ways. Raw features include but are not limited toControl Flow Graph, Attributed Control Flow Graph, etc. This inventionpresents one implementation of raw feature: Bi-Directional AttributedControl Flow Graph (BACFG) 104 defined as follows.

Definition 1. (Bi-directional Attributed Control Flow Graph) Thebi-directional attributed control flow graph, or BACFG in short, is aspecial directed graphs with two edges G=<V, E₁, E₂, φ>, where V is aset of basic blocks; E₁⊆V×V is a set of edges representing theconnections between these basic blocks, E₂=E₁ ^(T)⊆V×V is a set of edgesrepresenting the reversed connections between these basic blocks, and φ:V→Σ is the labeling function which maps a basic block in V to a set ofattributes in.

Bi-directional ACFG extraction embodiment 102 can be implemented usingdifferent approaches. One approach relies on disassemblers such as IDApro and Binary Ninja to disassemble the executable code 101. Everyfunction in the executable code is recovered and its raw features(control flow graph, basic block information) are extracted. Finally,BACFG 104 is constructed from this information for every function in theexecutable code 101. FIG. 2 presents an example of constructing BACFG104 from executable code. The disassembled raw code 201 is extractedusing IDA pro, a commercial disassembler from a piece of OpenSSLexecutable code. It contains the control flow graph of functionSSL_get_psk_identity_hint and basic block (n₁, n₂, n₃, n₄) information.202 is the corresponding BACFG constructed for functionSSL_get_pskidentity_hint. Every node in 201 represented in a set ofattributes. The edges in 202 are kept in generated BACFG 202. The dotedarrow line in 202 represents the reversed edges.

However, the generated BACFG 202 cannot be directly fed into graphembedding network to generate the embedding. To solve this problem, FIG.3 presents one approach to store BACFG 104. This new approach appliesthree matrices to store the information of node, edge and reversed edgesdefined in BACFG 104. Every row in node matrix represents one node inBACFG. E.g., n₁ in 202 is the first row in node matrix301. The edge andreversed edge in 202 information is represented using adjacent matrix302, 303. Now these three matrices can be fed into graph embeddingnetwork 413 to generate the embedding. More specifically, BACFG istreated as two graphs G₁=<V, E₁> and G₂=<V, E₂>. G₁ and G₂ are fed intograph embedding network 413 to generate the embedding e₁ and e₂respectively. Finally, the embedding of BACFG is calculated via(e₁+e₂)/2.

Graph embedding network 413 learns high-level feature representationsfrom the control flow graphs (in short, code graphs) and encode (i.e.,embed) the graphs into embeddings (i.e., high dimensional numericalvectors). It can be implemented in many ways. This invention presents animplementation based on an adapted neural network from Structure2Vec.

Denote a code graph as g=(V,E) where V and E are the sets of vertexesand edges respectively; furthermore, each vertex in the graph may haveadditional features x_(v) which correspond to block level features in acode graph. The graph embedding network will first compute a pdimensional feature μ_(v) for each vertex v∈V, and then the embeddingvector μ_(g) of g will be computed as an aggregation of the vertexembeddings.

More specifically, we denote N(v) as the set of neighbors of node v ingraph g. Then one variant of the structure2vec network will initializethe embedding μ_(v) ⁽⁰⁾ at each node as 0, and update the embeddings ateach iteration as

$\begin{matrix}{{\mu_{v}^{({t + 1})} = {F\left( {x_{v},{\sum\limits_{u \in {N{(v)}}}\mu_{u}^{(t)}}} \right)}},{\forall{v \in {V.}}}} & (1)\end{matrix}$

In this fixed-point update formula, F is a generic nonlinear mapping.

$\begin{matrix}{{F\left( {x_{v},{\sum\limits_{u \in {N{(v)}}}\mu_{u}}} \right)} = {\tanh\left( {{W_{1}x_{v}} + {\sigma\left( {\sum\limits_{u \in {N{(v)}}}\mu_{u}} \right)}} \right)}} & (2)\end{matrix}$

where x_(v) is a d-dimensional vector for graph node (or basic-block)level features, W₁ is a d×p matrix, and p is the embedding size asexplained above.

The parameters W₁ are trained using the Siamese architecture. TheSiamese architecture will use two identical graph embedding networkswhich join at the top. Each graph embedding network will take one codegraph ƒ_(i)(i=1, 2) as its input and outputs the embedding ϕ(ƒ_(i)). Thefinal outputs of the Siamese architecture is the cosine distance of thetwo embeddings. Furthermore, we will require that the code graphs fromthe same function compiled in different platforms or at differentoptimization levels to have similarity 1, while those compiled fromdifferent functions have similarity −1.

High Throughput Embedding Generation 105

It is fast to generate embedding for one single function (represented incode graph) each time. However, there are a huge volume of executablefiles that need to be processed every day. Generating embedding perfunction is too slow to handle such a large volume of code. To solvethis problem, this invention presents an implementation ofhigh-throughput embedding generation 105 by stacking the BACFGs offunctions.

Given n BACFGs {G₁, G₂, . . . , G_(n)} extracted from executable codeand each is represented with the format illustrated in FIG. 3, thisimplementation stacks these n BACFGs as illustrated in FIG. 4 togenerate embeddings in batch. Each BACFG is denoted as G_(i)=<V_(i),E_(1i), E_(2i), φ>.

Stacked Node Matrix M1 401: Each row M1[i] represents one node depictedin 301. The nodes of n BACFGs are stacked in their original order, andform the new node matrix M1.

Stacked Edge Matrix M2 405: As illustrated in 405, the edge matrix isstacked following the main diagonal. It can also be implemented bystacking following the anti-diagonal.

Stacked Reversed Edge Matrix M3 409: As illustrated in 409, the reversededge matrix is stacked following main diagonal. It can also beimplemented by stacking following the anti-diagonal.

The stacked BACFGs are then fed into the graph embedding network 413 andthe embeddings 416 415 414 of functions are calculated simultaneously.This approach almost improves the embedding generation performance by ntimes.

High-Throughput Similarity Calculation 107

The embedding of a function is actually a numeric vector. The distanceof two numeric vectors can be used to calculate the similarity offunctions. There are many algorithms to calculate the distance ofvectors. This invention applies cosine similarity to calculate thedistance. Given two vectors of attributes, A and B, the cosinesimilarity, cos(θ)), is represented using a dot product and magnitude as

$\begin{matrix}{{similarity} = {{\cos (\theta)} = {\frac{A \cdot B}{{A}{B}} = \frac{\sum\limits_{i = 1}^{n}\; {A_{i}{Bi}}}{\sqrt{\sum\limits_{i = 1}^{n}\; A_{i}^{2}}\sqrt{\sum\limits_{i = 1}^{n}\; B_{i}^{2}}}}}} & (3)\end{matrix}$

It is fast to calculate the similarity of two functions. But inpractice, the similarity comparison is usually applied on thousands ormillions of functions. E.g., one common task is to find vulnerablefunctions in docker images. The executable code size is usually over 100MB. There are approximately 102,400 functions if we assume the size perfunction is 1K (under estimated). To search one vulnerable function, itneeds to conduct 102,400 comparisons. And that is only for one dockerimage and searching only one vulnerable function. In reality, there arethousands of vulnerable functions and docker images. Apparently, theaforementioned method are not practical.

To make this approach practical, this invention presents a new way tocalcuate the similarity of embeddings in batch. As illustrated FIG. 5, nembeddings 501 with (1×m) dimension are stacked together to generate a(n×m) matrix A 502. Each embedding E_(ƒi) can be retrieved via A[i].

Given two groups of embeddings {E_(ƒa1), E_(ƒa2), . . . , E_(ƒan)} and{E_(ƒb1), E_(ƒb2), . . . , E_(ƒbn)}, this invention first stacks theseembeddings as illustrated in FIG. 5 into two matrices A and B. Thesimilarity of embeddings is calculated as

$\begin{matrix}{S = \frac{{AB}^{T}}{{A}{B}^{T}}} & (4)\end{matrix}$

, where S[i] [k] is the cosine similarity of embedding A [i] andembedding B [k] .

The above approach greatly improves the similarity comparisonperformance. In the test, it is only takes several seconds to conduct100 million comparisons on a Desktop PC with Intel i7-4790.

Applications 1: Executable File Similarity Comparison

One application of the high-throughput embedding generation 105 andhigh-throughput similarity calculation 107 is executable file similaritycomparison. The applications of executable file similarity comparisoninclude but are not limited to malware classification, provenance, etc.

This invention presents an approach to use embeddings of an executablefile to compare the similarity. FIG. 6 illustrates the workflow. Theinput executable file 601 is disassembled to extract BACFGs 602 of everyfunction using the approach described in 103. Then the high-throughputembedding generation 603 described in FIG. 4 is used to generate theembeddings {E_(ƒ1), E_(ƒ2), . . . , E_(ƒn)} for n functions extractedfrom the executable file. Finally, this invention applies PrincipalComponent Analysis (PCA) 605 on function embeddings {E_(ƒ1), E_(ƒ2), . .. , E_(ƒn)} to generate the embedding of executable file 606 by reducingthe functions embeddings to a fixed-length vector.

The embedding of the executable file is still a numeric vector. Thedistance of these numeric vectors can be used to calculate thesimilarity of executable files. This invention applies thehigh-throughput similarity calculation 107 discussed before to calculatethe similarity of different executable files.

Applications 2: Known Vulnerability Search

Vulnerability detection is getting harder as code size and the number ofthird-party libraries used increase, especially when code is staticallylinked. A general approach is to treat a vulnerability as one or morevulnerable functions. Then the problem of vulnerability search isconverted into the problem of searching semantically-equivalentfunctions in binary code.

The embodiment described in FIG. 7 presents an efficient knownvulnerability search solution by applying the high-throughput embeddinggeneration 105 and similarity calculation 106. The goal is to findvulnerable functions 701 in executable files 702. The basic idea isfirst to quickly get a short list of potential vulnerable functionsusing high-throughput embedding comparison. Then a conditional formulabased function identification 710 is applied on the candidates list tofind the vulnerable functions 711.

The executable files 701 and vulnerable functions are disassembled toextract BACFGs 702 of every function using the approach described in103. Then the high-throughput embedding generation 705 described in FIG.4 is used to generate the embeddings {E_(ƒ1), E_(ƒ2), . . . , E_(ƒn)}and {E_(vƒ1), E_(vƒ2), . . . , E_(vƒn)} for vulnerable functions andfunctions extracted from the executable file. The high-throughputsimilarity calculation 798 embodiment compares {E_(vƒ1), E_(vƒ2), . . ., E_(ƒn)} with {E_(ƒ1), E_(ƒ2), . . . , E_(ƒn)} to obtain the similarityscore of every vulnerable function and the functions from executablefiles. This similarity score is then used to generate a vulnerablefunction candidate list [ƒ1, ƒ2, . . . , ƒk] from functions extractedfrom executable files by selecting the top k most similar functions.

Since k is usually a small number (<20), expensive program analysis canbe applied to exactly determine if the functions in candidate list areindeed vulnerable. The conditional formula based function identificationembodiment 10 is implemented to identify the true vulnerable functionsin candidate list.

Generally speaking, a conditional formula consists of an If-clause and aThen-clause, and each clause is a symbolic formula, describing underwhat condition (stated in the If-clause) a given action (in theThen-clause) will take place. A conditional formula explicitly capturestwo cardinal factors of a buggy code: (1) erroneous data dependencies,and (2) missing or incorrect condition checks. Instead of treating thevulnerable function as a whole, searching on structured conditionalformulas can effectively localize the possibly vulnerable code logic. Bycontrasting conditional formulas between the vulnerable function and atarget candidate, we can quickly diagnose whether the target isvulnerable or a false positive.

The embodiment 10 first utilizes a binary lifting tool (such as BinaryNinja) to convert vulnerable functions and the candidate list to thesame higher-level intermediate representation (IR). Then it appliesprogram analysis techniques on the IR to construct conditional formulasfor every vulnerable function and candidate. The data dependency viapointers are carefully handled. Besides, not all the variables in afunction are of interests. Action point selection is conducted to filterirrelevant variables. Then embodiment 09 matches functions by theirunified conditional formulas. It can be implemented in many ways, suchas a constraint solver. Finally, identified vulnerable functions 711 canbe generated by removing the false positive candidates from [ƒ1, ƒ2, . .. , ƒk].

What is claimed is:
 1. A system for high-throughput embedding generationand comparison, compromising: take executable code and extractBi-directional ACFGs of every function from it; conduct high-throughputembedding generation for the Bi-directional ACFG; conducthigh-throughput similarity comparison of functions using the embeddings;compare the similarity of executable files by applying PrincipalComponent Analysis on embeddings of functions; search the vulnerabilityby combining high-throughput embedding generation and comparison withcondition formula comparison.
 2. The system of claim 1, furthercomprising the definition of Bi-directional ACFG.
 3. The system of claim1, further comprising the high-throughput embedding generation whichdeploys stacked Bi-directional ACFGs to maximize the throughput of theembedding network.
 4. They system of claims 1, further comprising theusage of bi-directional ACFG as the input of graph embedding network,which improves the accuracy of the invention.
 5. The system of claim 1,further comprising the high-throughput similarity calculation whichdeploys matrix manipulation to maximize the throughput of the system. 6.The system of claim 5, wherein the matrix manipulation is implemented bystacking function embedding vectors into matrix format, and processingin batches through one calculation to provide high speed cosinesimilarity calculation.
 7. The system of claim 1, wherein an executablefile similarity comparison system is implemented using high-throughputembedding generation and comparison system.
 8. The system of claim 7,wherein principal component analysis is conducted on embeddings offunctions extracted from executable to generate the embedding ofexecutable file.
 9. The system of claim 7, wherein the cosine similarityof executable files' embeddings are used to calculate the similarity ofexecutable files.
 10. The system of claim 1, wherein a vulnerabilitysearch system is implemented using high-throughput embedding generationand comparison system.
 11. The system of claim 10, whereinhigh-throughput embedding generation and comparison system is used toidentify the candidates list of vulnerable functions.
 12. The system ofclaim 10, wherein condition formula comparison is used to identify thetrue positive vulnerable functions in the candidates list.